%matplotlib inline
Due date : 2022-01-30 @23h55 (this is a hard deadline)
Write in English or French
xxx.ipynb file (jupyter notebook) or a xxx.py file (if you are using jupytext) built by completing the template. We won't execute the code in your notebook: all your results, displays and plots must be visible without having to rerun everything.Once again, follow the steps described here : course homepage. If you don't: no evaluation!
Here is the way we'll assess your work
| Criterion | Points | Details |
|---|---|---|
| Spelling and syntax | 3 | English/French |
| Plots correction | 3 | Clarity / answers the question |
| Plot style and cleanliness | 3 | Titles, legends, labels, breaks ... |
| Table wrangling | 4 | ETL, SQL like manipulations |
| Computing Statistics | 5 | SQL goup by and aggregation |
| DRY compliance | 2 | DRY principle at Wikipedia |
If we see a single (or more) for loop in your code: -5 points. Everything can be done using high-level pandas methods
Base Python can do a lot. But the full power of Python comes from a fast growing collection of packages/modules.
Packages are first installed (that is using pip install or conda install), and if
needed, imported during a session.
The docker image you are supposed to use already offers a lot of packages. You should not need to install new packages.
Once a package has been installed on your drive, if you want all objects exported by the package to be available in your session, you should import the package, using from pkg import *.
If you just want to pick some sobjects from the package,
you can use qualified names like pkg.object_name to access the object (function, dataset, class...)
# importing basic tools
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype
import os # file operations
import requests # networking
from datetime import date # if needed
# importing plotting packages
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
# make pandas plotly-friendly
np.set_printoptions(precision=2, suppress=True)
%matplotlib inline
pd.options.plotting.backend = "plotly"
The data are built and made available by INSEE (French Governement Statistics Institute)
Prénoms:
This dataset has been growing for a while. It has been considered by social scientists for decades. Given names are meant to give insights into a variety of phenomena, including religious observance.
A glimpse at the body of work can be found in L'archipel français by Jérome Fourquet, Le Seuil, 2019
Read the File documentation
QUESTION: Download the data into a file which relative path is './nat2020_csv.zip'
Hints:
requests.params = dict(
url = 'https://www.insee.fr/fr/statistiques/fichier/2540004/nat2020_csv.zip',
dirpath = './',
timecourse = '',
datafile = 'nat2020.hdf',
filename = 'nat2020_csv.zip')
# modify location make sure you are in the right directory
# %cd
# %pwd #
# %ls
# %mkdir # if needed
url = params['url'] # 'https://www.insee.fr/fr/statistiques/fichier/2540004/nat2020_csv.zip'
filename = params['filename'] # 'nat2020_csv.zip'
# your code here
if os.path.exists(os.path.join(params['dirpath'], filename)):
print(f"The file {os.path.join(params['dirpath'], filename)} already exists.")
else:
r = requests.get(url)
with open(os.path.join(params['dirpath'], filename), 'wb') as f:
f.write(r.content)
print(f"Downloaded file {os.path.join(params['dirpath'], filename)}")
### On a utilisé le même code que dans le notebook01_python, Partie I/O
The file ./nat2020_csv.zip already exists.
QUESTION: Load the data in a pandas DataFrame called data
Hints:
Pandas dataframe with 4 columns.csv file.pandas provides the convenient tools.sample() from class DataFrame can be helpful.# your code here
df = pd.read_csv(filename, sep = ";", compression='zip')
df.shape
(667364, 4)
print(df.head(n=3),'\n\n',df.tail(n=3))
sexe preusuel annais nombre
0 1 _PRENOMS_RARES 1900 1250
1 1 _PRENOMS_RARES 1901 1342
2 1 _PRENOMS_RARES 1902 1330
sexe preusuel annais nombre
667361 2 ZYNEB 2019 7
667362 2 ZYNEB 2020 8
667363 2 ZYNEB XXXX 19
QUESTION: Look at the data, Use the attributes columns, dtypes and the methods head, describe, to get a feeling of the data.
This dataset is supposed to report all given names used for either sex during a year in France since 1900
The file is made of 652 056 lines and 4 columns.
|-- preusuel : object
|-- nombre: int64
|-- sexe: int64
|-- annais: object
Each row indicates for a given preusuel (prénom usuel, given name), sexe (sex), and annais (année naissance, birthyear) the nombre (number) of babies of the given sex who were given that name during the given year.
| sexe | preusuel | annais | nombre |
|---|---|---|---|
| 2 | SYLVETTE | 1953 | 577 |
| 1 | BOUBOU | 1979 | 4 |
| 1 | NILS | 1959 | 3 |
| 2 | NICOLE | 2003 | 36 |
| 1 | JOSÉLITO | 2013 | 4 |
QUESTION: Compare memory usage and disk space used by data
Hints:
info prints a concise summary of a DataFrame.memory_usage, you can get an estimate
of the amount of memory used by the DataFrame.disk = os.path.getsize(os.path.join(params['dirpath'], filename))/1024/1024
print("Disk space used : {} MB".format(disk))
### for disk space
Disk space used : 2.3307418823242188 MB
# Méthode 1: memory_usage - renvoie en Bytes (B) - /1024/1024 -> MegaBytes (MB)
mem = df.memory_usage(deep=True).sum()/1024/1024
print("Méthode 1 - memory_usage : {} MB\n".format(mem))
# Méthode 2: info - renvoie en MB
print("Méthode 2 - info : "); df.info(memory_usage='deep')
### for memory usage
Méthode 1 - memory_usage : 90.65318870544434 MB Méthode 2 - info : <class 'pandas.core.frame.DataFrame'> RangeIndex: 667364 entries, 0 to 667363 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sexe 667364 non-null int64 1 preusuel 667362 non-null object 2 annais 667364 non-null object 3 nombre 667364 non-null int64 dtypes: int64(2), object(2) memory usage: 90.7 MB
# your code here
print("Memory usage ({} MB) >>> Disk Space ({} MB) by {} MB (x{} greater)"\
.format(round(mem,3),
round(disk,3),
round(mem-disk,3),
round(mem/disk)))
Memory usage (90.653 MB) >>> Disk Space (2.331 MB) by 88.322 MB (x39 greater)
QUESTION: Display the output of .describe() with style.
# your code here
(
df.groupby('sexe')
.describe()
.style
.format("{:.3}")
.background_gradient(cmap='Blues')
)
### sexe est un input catégorique (bien que de dtype int). On peut groupby avec
| nombre | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| sexe | ||||||||
| 1 | 3.04e+05 | 1.43e+02 | 9.72e+02 | 1.0 | 4.0 | 8.0 | 25.0 | 5.36e+04 |
| 2 | 3.63e+05 | 1.18e+02 | 8.05e+02 | 1.0 | 4.0 | 8.0 | 25.0 | 5.22e+04 |
QUESTION: For each column compute the number of distinct values
# your code here
df.nunique(axis=0)
sexe 2 preusuel 35010 annais 122 nombre 7304 dtype: int64
QUESTION: Make sexe a category with two levels Female and Male. Call the new column genre. Do you see any reason why this factor should be ordered?
Hint: Read Pandas and categorical variables
# your code here
def genre(n):
return 'Male' if n==1 else 'Female'
df['genre'] = df['sexe'].apply(lambda x: genre(x))
df['genre'] = df['genre'].astype('category')
print(df.head(n=3),'\n\n',df.tail(n=3))
### il n'y a pas de raison à donner un ordre entre Male et Female
sexe preusuel annais nombre genre
0 1 _PRENOMS_RARES 1900 1250 Male
1 1 _PRENOMS_RARES 1901 1342 Male
2 1 _PRENOMS_RARES 1902 1330 Male
sexe preusuel annais nombre genre
667361 2 ZYNEB 2019 7 Female
667362 2 ZYNEB 2020 8 Female
667363 2 ZYNEB XXXX 19 Female
QUESTION: Compare memory usage of columns sexe and genre
# your code here
mem_sexe, mem_genre = df[['sexe','genre']].memory_usage(deep=True,index=False)/1024/1024
print("Memory usage of sexe ({} MB) > Memory usage of genre ({} MB) by {} MB (x{} greater)"\
.format(round(mem_sexe,3),
round(mem_genre,3),
round(mem_sexe-mem_genre,3),
round(mem_sexe/mem_genre)))
Memory usage of sexe (5.092 MB) > Memory usage of genre (0.637 MB) by 4.455 MB (x8 greater)
QUESTION: Would it be more memory-efficient to recode sexe using modalities F and M instead of Male and Female ?
Insert your answer here
Oui, cela nous ferait gagner un peu de mémoire en plus car moins de lettres à encoder donc moins de bytes utilisés
QUESTION: Variable annais class is object. Make annais of type float. Note that missing years are encoded as "XXXX", find a way to deal with that.
Hint: As of releasing this Homework (2021-01-22), Pandas is not very good at managing missing values,
see roadmap. Don't try to convert annais into an integer column.
# your code here
def annais(n):
return np.nan if n=="XXXX" else n
df['annais'] = df['annais'].apply(lambda x: annais(x))
df['annais'] = df['annais'].astype('float')
df['annais']
0 1900.0
1 1901.0
2 1902.0
3 1903.0
4 1904.0
...
667359 2017.0
667360 2018.0
667361 2019.0
667362 2020.0
667363 NaN
Name: annais, Length: 667364, dtype: float64
QUESTION: Remove useless columns (now that you've created new ones, and rename them). You should end up with a dataframe with columns called "gender", "year", "count", "firstname" with the following dtypes:
gender category
firstname object
count int64
year float64
# your code here
df = df[df.columns[1:]] # cols = ['preusuel', 'annais', 'nombre', 'genre']
cols = df.columns; df.rename(columns={cols[0]:"firstname",
cols[1]:"year",
cols[2]:"count",
cols[3]:"gender"},inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 667364 entries, 0 to 667363 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 firstname 667362 non-null object 1 year 630689 non-null float64 2 count 667364 non-null int64 3 gender 667364 non-null category dtypes: category(1), float64(1), int64(1), object(1) memory usage: 15.9+ MB
QUESTION: How many missing values (NA) have been introduced? How many births are concerned?
# your code here
nan_per_column = np.sum(df.isnull())
nan_total = np.sum(nan_per_column)
print("Il y a {} années avec des missing values. Il y en a {} au total"\
.format(nan_per_column['year'],nan_total))
nan_per_column
Il y a 36675 années avec des missing values. Il y en a 36677 au total
firstname 2 year 36675 count 0 gender 0 dtype: int64
QUESTION: Read the documentation and describe the origin of rows containing the missing values.
Insert your answer here
Le dataset est issu de bulletins de naissance qui sont établis à la base à partir de la déclaration des parents. C'est pourquoi on peut lire dans la documentation "L'Insee ne peut garantir que le fichier des prénoms soit exempt d'omissions ou d'erreurs". C'est très probablement un problème de déclarations incomplètes
QUESTION: Save the transformed dataframe (retyped and renamed) to ./nat2020_csv.zip. Try several compression methods.
# your code here
compression_opts = dict(method='zip',archive_name='nat2020-modif.csv')
df.to_csv('nat2020.zip', index=False, compression=compression_opts)
### il faut sauver dans un nouveau zip non ?
### (assez confus d'après le markdown en dessous de la cellule [22]
### -> pas de nat2020_csv.zip mais nat2020.zip.
### -> faut-il delete le 1er ? nat2020.zip est-il en réalité nat2020_csv.zip ?
### -> je crée un nouveau zip au cas où, je ne prends pas de risque)
QUESTION: Save the transformed dataframe (retyped and renamed) to ./nat2020.hdf using .hdf format
# your code here
df.to_hdf("nat2020.hdf", key='df', format='table') # c'est assez lourd quand même !
At that point your working directory should look like:
├── homework01.py # if you use `jupytext`
├── homework01.ipynb
├── nat2020.hdf
├── nat2020.zip
QUESTION: Reload the data using read_hdf(...) so that the resulting dataframe is properly typed with meaningful column names.
Hint: use try: ... except to handle exceptions such as FileNotFoundError
# your code here
pd.read_hdf("nat2020.hdf")
| firstname | year | count | gender | |
|---|---|---|---|---|
| 0 | _PRENOMS_RARES | 1900.0 | 1250 | Male |
| 1 | _PRENOMS_RARES | 1901.0 | 1342 | Male |
| 2 | _PRENOMS_RARES | 1902.0 | 1330 | Male |
| 3 | _PRENOMS_RARES | 1903.0 | 1286 | Male |
| 4 | _PRENOMS_RARES | 1904.0 | 1430 | Male |
| ... | ... | ... | ... | ... |
| 667359 | ZYNEB | 2017.0 | 6 | Female |
| 667360 | ZYNEB | 2018.0 | 5 | Female |
| 667361 | ZYNEB | 2019.0 | 7 | Female |
| 667362 | ZYNEB | 2020.0 | 8 | Female |
| 667363 | ZYNEB | NaN | 19 | Female |
667364 rows × 4 columns
QUESTION: For each year, compute the total number of Female and Male births and the proportion of Female births among total births
Hints:
Groupby operations using several columns for the groups return a dataframe with a MultiIndex index see Pandas advanced
Have a look at MultiIndex, reset_index, pivot, columns.droplevel
# your code here
gb1 = df.groupby(["year", "gender"]).agg({'count':'sum'})
gb1['%'] = 100*gb1['count']/gb1.groupby(level=0)['count'].sum()
gb1
| count | % | ||
|---|---|---|---|
| year | gender | ||
| 1900.0 | Female | 237653 | 57.260126 |
| Male | 177388 | 42.739874 | |
| 1901.0 | Female | 257492 | 56.784341 |
| Male | 195964 | 43.215659 | |
| 1902.0 | Female | 261437 | 56.127654 |
| ... | ... | ... | ... |
| 2018.0 | Male | 377802 | 51.313527 |
| 2019.0 | Female | 356915 | 48.817768 |
| Male | 374202 | 51.182232 | |
| 2020.0 | Female | 348024 | 48.792538 |
| Male | 365249 | 51.207462 |
242 rows × 2 columns
QUESTION: Plot the proportion of female births as a function of year. Don't forget: title, axes labels, ticks, scales, etc.
Because of what we did before, the plot method of a DataFrame with be rendered using plotly, so you can use this. But you can use also seaborn or any other available plotting library that you want.
Hint: Mind the missing values in the year column
# your code here
gb1.unstack(1)['%'].reset_index().plot('year','Female')
QUESTION: Make any sensible comment about this plot.
Insert your answer here
Une décroissance presque exponentielle de la proportion de naissances de filles entre 1900 et 1940 environ, passant de 57 à 50%. Il se stabilise vers les années 50 autout des 49%, proportion qui est toujours la même aujourd'hui. Passage from "plus de naissances de filles que de garçons" to "naissances équilibrées"
QUESTION: Explore the fluctuations of sex ratio around its mean value since 1945. Plot deviations of sex ratio around its mean since 1945 as a function of time.
# your code here
mean_sex_f = gb1.unstack(1).loc[1945:,('%','Female')].mean()
__ = (gb1.unstack(1)['%']-mean_sex_f).reset_index().plot.bar('year','Female')
__.add_hline(y=0, line_width=3, line_dash="dash", line_color="green", annotation_text="mean_since_1945", annotation_position = "top right")
QUESTION: For each firstname and sex (some names may be given to girls and boys), compute the total number of times this firstname has been given during 1900-2019. Print the top 20 firstnames given and style your result dataframe using background_gradient for instance.
# your code here
(
df.loc[(df.year>1899) & (df.year<2020)].groupby(['firstname','gender'])
.agg({'count':'sum'})
.sort_values('count',ascending=False)
.head(20)
.style.background_gradient(cmap='Blues')
)
| count | ||
|---|---|---|
| firstname | gender | |
| MARIE | Female | 2231586 |
| JEAN | Male | 1913439 |
| PIERRE | Male | 891177 |
| _PRENOMS_RARES | Female | 826278 |
| MICHEL | Male | 820288 |
| _PRENOMS_RARES | Male | 773079 |
| ANDRÉ | Male | 711845 |
| JEANNE | Female | 557620 |
| PHILIPPE | Male | 538427 |
| LOUIS | Male | 521436 |
| RENÉ | Male | 516225 |
| ALAIN | Male | 506888 |
| JACQUES | Male | 482701 |
| BERNARD | Male | 469254 |
| MARCEL | Male | 468120 |
| DANIEL | Male | 435278 |
| ROGER | Male | 423583 |
| ROBERT | Male | 419096 |
| PAUL | Male | 419012 |
| CLAUDE | Male | 411726 |
QUESTION: For each sex, plot the proportion of births given _PRENOMS_RARES as a function of the year.
gb2 = (
df.loc[df.firstname=='_PRENOMS_RARES']
.groupby(['gender','year'])
.agg({'count':'sum'})
)
gb2['%'] = 100*gb2['count']/gb2.groupby(level=1)['count'].sum()
# your code here
gb2.unstack(0)['%'].reset_index().plot("year",['Female','Male'])
QUESTION: Plot the proportion of female births given name 'MARIE' or 'MARIE-...' (compounded names) as a function of the year.
Proceed in such a way that the reader can see the share of compounded names. We are expecting an area plot.
Hints:
.str accessor (to apply a string method over a whole column containing string)See Graphique 3, page 48, de L'archipel français de J. Fourquet. Le Seuil. Essais. Vol. 898.
Add annotation, 1st World War, Front Populaire, 2nd World War, 1968
import regex as re
pattern="(^MARIE)(-|$)([\w]*)"
marie = df.iloc[df["firstname"].str.extract(pattern).dropna().index,:].copy()
marie
| firstname | year | count | gender | |
|---|---|---|---|---|
| 189730 | MARIE | 1900.0 | 1065 | Male |
| 189731 | MARIE | 1901.0 | 1048 | Male |
| 189732 | MARIE | 1902.0 | 1066 | Male |
| 189733 | MARIE | 1903.0 | 1054 | Male |
| 189734 | MARIE | 1904.0 | 974 | Male |
| ... | ... | ... | ... | ... |
| 538944 | MARIE-ZOÉ | 1993.0 | 3 | Female |
| 538945 | MARIE-ZOÉ | 1998.0 | 3 | Female |
| 538946 | MARIE-ZOÉ | 2003.0 | 3 | Female |
| 538947 | MARIE-ZOÉ | 2004.0 | 3 | Female |
| 538948 | MARIE-ZOÉ | NaN | 34 | Female |
12327 rows × 4 columns
def marie_comp(x):
if x=='MARIE' : return 0
elif x.startswith('MARIE-') : return 1
marie['marie_comp'] = marie.firstname.apply(lambda x: marie_comp(x))
marie = marie.loc[marie.gender=='Female']
marie
| firstname | year | count | gender | marie_comp | |
|---|---|---|---|---|---|
| 526858 | MARIE | 1900.0 | 48713 | Female | 0 |
| 526859 | MARIE | 1901.0 | 52150 | Female | 0 |
| 526860 | MARIE | 1902.0 | 51857 | Female | 0 |
| 526861 | MARIE | 1903.0 | 50424 | Female | 0 |
| 526862 | MARIE | 1904.0 | 50131 | Female | 0 |
| ... | ... | ... | ... | ... | ... |
| 538944 | MARIE-ZOÉ | 1993.0 | 3 | Female | 1 |
| 538945 | MARIE-ZOÉ | 1998.0 | 3 | Female | 1 |
| 538946 | MARIE-ZOÉ | 2003.0 | 3 | Female | 1 |
| 538947 | MARIE-ZOÉ | 2004.0 | 3 | Female | 1 |
| 538948 | MARIE-ZOÉ | NaN | 34 | Female | 1 |
12091 rows × 5 columns
mg = marie.groupby(['year','marie_comp']).agg({('count','sum')}).loc[:,'count'] #242 rows
mg['% of f_births'] = 100*mg['count']/gb1['count'].unstack(1).loc[:,'Female']
mg
<ipython-input-32-7737463d3938>:1: FutureWarning: ['gender'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
| count | % of f_births | ||
|---|---|---|---|
| year | marie_comp | ||
| 1900.0 | 0 | 48713 | 20.497532 |
| 1 | 2194 | 0.923195 | |
| 1901.0 | 0 | 52150 | 20.253056 |
| 1 | 2255 | 0.875755 | |
| 1902.0 | 0 | 51857 | 19.835371 |
| ... | ... | ... | ... |
| 2018.0 | 1 | 173 | 0.048262 |
| 2019.0 | 0 | 747 | 0.209294 |
| 1 | 155 | 0.043428 | |
| 2020.0 | 0 | 652 | 0.187343 |
| 1 | 139 | 0.039940 |
242 rows × 2 columns
# your code here
_ = mg.unstack(1)['% of f_births'].plot.area()
_.add_vrect(x0=1914, x1=1918, fillcolor="black", opacity=0.25, line_width=0, annotation_text="WW1", annotation_font_size=12, annotation_font_color='red')
_.add_vrect(x0=1939, x1=1945, fillcolor="black", opacity=0.25, line_width=0, annotation_text="WW2", annotation_font_size=12, annotation_font_color='red')
_.add_vrect(x0=1936, x1=1938, fillcolor="green", opacity=0.25, line_width=0, annotation_text="f_populaire", annotation_font_size=10, annotation_font_color='black')
_.add_vline(x=1968, line_width=3, line_dash="dash", line_color="green", annotation_text="1968", annotation_position = "top left")
QUESTION: For each sex, select the ten most popular names in year 2000, and plot the proportion of newborns given that name over time. Take into account that some names might have zero occurrence during certain years.
Hint: Leave aside the rows with '_PRENOMS_RARES'.
gender_gb = (
df.loc[(df.year==2000)&(df.firstname!='_PRENOMS_RARES')]
.groupby(['gender','firstname'])
.agg({'count':'sum'})
)
f_10 = gender_gb.loc['Female',:].sort_values('count',ascending=False).head(10).index
m_10 = gender_gb.loc['Male',:].sort_values('count',ascending=False).head(10).index
_m = df[(df.gender=='Male')&(df.firstname.isin(m_10))].groupby(['year','firstname']).agg({'count':'sum'})
_f = df[(df.gender=='Female')&(df.firstname.isin(f_10))].groupby(['year','firstname']).agg({'count':'sum'})
rm = 100*_m['count']/df.loc[(df.firstname!='_PRENOMS_RARES')].groupby(['year','gender']).agg({'count':'sum'}).unstack(0).loc['Male','count']
rf = 100*_f['count']/df.loc[(df.firstname!='_PRENOMS_RARES')].groupby(['year','gender']).agg({'count':'sum'}).unstack(0).loc['Female','count']
# your code here
rm.unstack(1).plot()
rf.unstack(1).plot()
plotly for animated plots
The US babynames analogue of the INSEE file has been a playground for data scientists,
see https://github.com/hadley/babynames
Don't Repeat Yourself (DRY) principle at Wikipedia